{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CH4 数据源获取（食材准备）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.1 外部数据导入"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1.1 Excel数据导入"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在导入Excel时，要注意：如果使用文件路径的\\，必须要在前面加一个r表示转义符<br>\n",
    "如果不希望加，也可以用/来替代"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>编号</th>\n",
       "      <th>年龄</th>\n",
       "      <th>性别</th>\n",
       "      <th>注册时间</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A1</td>\n",
       "      <td>54</td>\n",
       "      <td>男</td>\n",
       "      <td>2018-08-08</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A2</td>\n",
       "      <td>16</td>\n",
       "      <td>女</td>\n",
       "      <td>2018-08-09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A3</td>\n",
       "      <td>47</td>\n",
       "      <td>女</td>\n",
       "      <td>2018-08-10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>A4</td>\n",
       "      <td>41</td>\n",
       "      <td>男</td>\n",
       "      <td>2018-08-11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   编号  年龄 性别       注册时间\n",
       "0  A1  54  男 2018-08-08\n",
       "1  A2  16  女 2018-08-09\n",
       "2  A3  47  女 2018-08-10\n",
       "3  A4  41  男 2018-08-11"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_excel(r\"C:\\Users\\ZHOU YUFENG\\Documents\\Python Scripts\\对比Excel学习Python\\Excel-Python-Learning-Note\\test.xlsx\")\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>编号</th>\n",
       "      <th>年龄</th>\n",
       "      <th>性别</th>\n",
       "      <th>注册时间</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A1</td>\n",
       "      <td>54</td>\n",
       "      <td>男</td>\n",
       "      <td>2018-08-08</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A2</td>\n",
       "      <td>16</td>\n",
       "      <td>女</td>\n",
       "      <td>2018-08-09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A3</td>\n",
       "      <td>47</td>\n",
       "      <td>女</td>\n",
       "      <td>2018-08-10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>A4</td>\n",
       "      <td>41</td>\n",
       "      <td>男</td>\n",
       "      <td>2018-08-11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   编号  年龄 性别       注册时间\n",
       "0  A1  54  男 2018-08-08\n",
       "1  A2  16  女 2018-08-09\n",
       "2  A3  47  女 2018-08-10\n",
       "3  A4  41  男 2018-08-11"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_excel(\"C:/Users/ZHOU YUFENG/Documents/Python Scripts/对比Excel学习Python/Excel-Python-Learning-Note/test.xlsx\")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由于Excel有一个Sheet的概念，若要导入Excel有多个Sheet，需明确指明Sheet具体名或序号，以免出现混乱<br>\n",
    "具体的参数为sheet_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>编号</th>\n",
       "      <th>年龄</th>\n",
       "      <th>性别</th>\n",
       "      <th>注册时间</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A1</td>\n",
       "      <td>54</td>\n",
       "      <td>男</td>\n",
       "      <td>2018-08-08</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A2</td>\n",
       "      <td>16</td>\n",
       "      <td>女</td>\n",
       "      <td>2018-08-09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A3</td>\n",
       "      <td>47</td>\n",
       "      <td>女</td>\n",
       "      <td>2018-08-10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>A4</td>\n",
       "      <td>41</td>\n",
       "      <td>男</td>\n",
       "      <td>2018-08-11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   编号  年龄 性别       注册时间\n",
       "0  A1  54  男 2018-08-08\n",
       "1  A2  16  女 2018-08-09\n",
       "2  A3  47  女 2018-08-10\n",
       "3  A4  41  男 2018-08-11"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_excel(\"C:/Users/ZHOU YUFENG/Documents/Python Scripts/对比Excel学习Python/Excel-Python-Learning-Note/test.xlsx\", sheet_name = \"Sheet1\")\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>编号</th>\n",
       "      <th>年龄</th>\n",
       "      <th>性别</th>\n",
       "      <th>注册时间</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A1</td>\n",
       "      <td>54</td>\n",
       "      <td>男</td>\n",
       "      <td>2018-08-08</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A2</td>\n",
       "      <td>16</td>\n",
       "      <td>女</td>\n",
       "      <td>2018-08-09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A3</td>\n",
       "      <td>47</td>\n",
       "      <td>女</td>\n",
       "      <td>2018-08-10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>A4</td>\n",
       "      <td>41</td>\n",
       "      <td>男</td>\n",
       "      <td>2018-08-11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   编号  年龄 性别       注册时间\n",
       "0  A1  54  男 2018-08-08\n",
       "1  A2  16  女 2018-08-09\n",
       "2  A3  47  女 2018-08-10\n",
       "3  A4  41  男 2018-08-11"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_excel(\"C:/Users/ZHOU YUFENG/Documents/Python Scripts/对比Excel学习Python/Excel-Python-Learning-Note/test.xlsx\", sheet_name = 0)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在导入时，行索引也是默认从0开始依次往后，但也可以直接指定具体哪个字段作为索引"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>年龄</th>\n",
       "      <th>性别</th>\n",
       "      <th>注册时间</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>编号</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>A1</th>\n",
       "      <td>54</td>\n",
       "      <td>男</td>\n",
       "      <td>2018-08-08</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A2</th>\n",
       "      <td>16</td>\n",
       "      <td>女</td>\n",
       "      <td>2018-08-09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A3</th>\n",
       "      <td>47</td>\n",
       "      <td>女</td>\n",
       "      <td>2018-08-10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>A4</th>\n",
       "      <td>41</td>\n",
       "      <td>男</td>\n",
       "      <td>2018-08-11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    年龄 性别       注册时间\n",
       "编号                  \n",
       "A1  54  男 2018-08-08\n",
       "A2  16  女 2018-08-09\n",
       "A3  47  女 2018-08-10\n",
       "A4  41  男 2018-08-11"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_excel(\"C:/Users/ZHOU YUFENG/Documents/Python Scripts/对比Excel学习Python/Excel-Python-Learning-Note/test.xlsx\", sheet_name = 0, \n",
    "                   index_col = 0)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在导入时，默认第一行作为列索引，也可以用header方法手动设置列索引<br>\n",
    "当然，如果数据集本身就不带有标题，则header = None，最后再用df.columns = [\"  \", \"  \", \"  \" ]传入"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>编号</th>\n",
       "      <th>年龄</th>\n",
       "      <th>性别</th>\n",
       "      <th>注册时间</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A1</td>\n",
       "      <td>54</td>\n",
       "      <td>男</td>\n",
       "      <td>2018-08-08</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A2</td>\n",
       "      <td>16</td>\n",
       "      <td>女</td>\n",
       "      <td>2018-08-09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A3</td>\n",
       "      <td>47</td>\n",
       "      <td>女</td>\n",
       "      <td>2018-08-10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>A4</td>\n",
       "      <td>41</td>\n",
       "      <td>男</td>\n",
       "      <td>2018-08-11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   编号  年龄 性别       注册时间\n",
       "0  A1  54  男 2018-08-08\n",
       "1  A2  16  女 2018-08-09\n",
       "2  A3  47  女 2018-08-10\n",
       "3  A4  41  男 2018-08-11"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_excel(\"C:/Users/ZHOU YUFENG/Documents/Python Scripts/对比Excel学习Python/Excel-Python-Learning-Note/test.xlsx\", sheet_name = 0,\n",
    "                   header = 0)\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>编号</td>\n",
       "      <td>年龄</td>\n",
       "      <td>性别</td>\n",
       "      <td>注册时间</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A1</td>\n",
       "      <td>54</td>\n",
       "      <td>男</td>\n",
       "      <td>2018-08-08 00:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A2</td>\n",
       "      <td>16</td>\n",
       "      <td>女</td>\n",
       "      <td>2018-08-09 00:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>A3</td>\n",
       "      <td>47</td>\n",
       "      <td>女</td>\n",
       "      <td>2018-08-10 00:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>A4</td>\n",
       "      <td>41</td>\n",
       "      <td>男</td>\n",
       "      <td>2018-08-11 00:00:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    0   1   2                    3\n",
       "0  编号  年龄  性别                 注册时间\n",
       "1  A1  54   男  2018-08-08 00:00:00\n",
       "2  A2  16   女  2018-08-09 00:00:00\n",
       "3  A3  47   女  2018-08-10 00:00:00\n",
       "4  A4  41   男  2018-08-11 00:00:00"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_excel(\"C:/Users/ZHOU YUFENG/Documents/Python Scripts/对比Excel学习Python/Excel-Python-Learning-Note/test.xlsx\", sheet_name = 0,\n",
    "                   header = None)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在导入时，如果字段比较多，但只需要部分列时，可以仅选取部分列，用usecols参数来实现，可以以列表形式传递多个值,同样从0开始计数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>编号</th>\n",
       "      <th>性别</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A1</td>\n",
       "      <td>男</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A2</td>\n",
       "      <td>女</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A3</td>\n",
       "      <td>女</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>A4</td>\n",
       "      <td>男</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   编号 性别\n",
       "0  A1  男\n",
       "1  A2  女\n",
       "2  A3  女\n",
       "3  A4  男"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_excel(\"C:/Users/ZHOU YUFENG/Documents/Python Scripts/对比Excel学习Python/Excel-Python-Learning-Note/test.xlsx\", usecols = [0,2])\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1.2 CSV数据导入"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas读取csv的方法类似，但还是需要注意如下：<br>\n",
    "1.默认读取是以逗号分隔的，如果是空格，需要人为指定<br>\n",
    "2.默认的编码格式是utf-8，如果不是，则需要将编码格式改为encoding = \"gbk\"<br>\n",
    "3.如果路径中有汉字，需要添加engine = \"python\"<br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>编号</th>\n",
       "      <th>年龄</th>\n",
       "      <th>性别</th>\n",
       "      <th>注册时间</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A1</td>\n",
       "      <td>54</td>\n",
       "      <td>男</td>\n",
       "      <td>2018/8/8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A2</td>\n",
       "      <td>16</td>\n",
       "      <td>女</td>\n",
       "      <td>2018/8/9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A3</td>\n",
       "      <td>47</td>\n",
       "      <td>女</td>\n",
       "      <td>2018/8/10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>A4</td>\n",
       "      <td>41</td>\n",
       "      <td>男</td>\n",
       "      <td>2018/8/11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   编号  年龄 性别       注册时间\n",
       "0  A1  54  男   2018/8/8\n",
       "1  A2  16  女   2018/8/9\n",
       "2  A3  47  女  2018/8/10\n",
       "3  A4  41  男  2018/8/11"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(\"test.csv\")\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>编号</th>\n",
       "      <th>年龄</th>\n",
       "      <th>性别</th>\n",
       "      <th>注册时间</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A1</td>\n",
       "      <td>54</td>\n",
       "      <td>男</td>\n",
       "      <td>2018/8/8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A2</td>\n",
       "      <td>16</td>\n",
       "      <td>女</td>\n",
       "      <td>2018/8/9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A3</td>\n",
       "      <td>47</td>\n",
       "      <td>女</td>\n",
       "      <td>2018/8/10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>A4</td>\n",
       "      <td>41</td>\n",
       "      <td>男</td>\n",
       "      <td>2018/8/11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   编号  年龄 性别       注册时间\n",
       "0  A1  54  男   2018/8/8\n",
       "1  A2  16  女   2018/8/9\n",
       "2  A3  47  女  2018/8/10\n",
       "3  A4  41  男  2018/8/11"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(\"test1.csv\",sep = \" \")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果仅需要读取几行数据，也可添加参数nrows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>编号</th>\n",
       "      <th>年龄</th>\n",
       "      <th>性别</th>\n",
       "      <th>注册时间</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A1</td>\n",
       "      <td>54</td>\n",
       "      <td>男</td>\n",
       "      <td>2018/8/8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A2</td>\n",
       "      <td>16</td>\n",
       "      <td>女</td>\n",
       "      <td>2018/8/9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   编号  年龄 性别      注册时间\n",
       "0  A1  54  男  2018/8/8\n",
       "1  A2  16  女  2018/8/9"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(\"test.csv\", nrows = 2)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1.3 TXT数据导入"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "txt可以用read_table()方法导入，这个方法比较万能，除了txt之外，scv也可以导入"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>编号</th>\n",
       "      <th>年龄</th>\n",
       "      <th>性别</th>\n",
       "      <th>注册时间</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A1</td>\n",
       "      <td>54</td>\n",
       "      <td>男</td>\n",
       "      <td>2018/8/8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A2</td>\n",
       "      <td>16</td>\n",
       "      <td>女</td>\n",
       "      <td>2018/8/9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A3</td>\n",
       "      <td>47</td>\n",
       "      <td>女</td>\n",
       "      <td>2018/8/10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>A4</td>\n",
       "      <td>41</td>\n",
       "      <td>男</td>\n",
       "      <td>2018/8/11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   编号  年龄 性别       注册时间\n",
       "0  A1  54  男   2018/8/8\n",
       "1  A2  16  女   2018/8/9\n",
       "2  A3  47  女  2018/8/10\n",
       "3  A4  41  男  2018/8/11"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_table(\"test.txt\", sep = \" \")\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>编号</th>\n",
       "      <th>年龄</th>\n",
       "      <th>性别</th>\n",
       "      <th>注册时间</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A1</td>\n",
       "      <td>54</td>\n",
       "      <td>男</td>\n",
       "      <td>2018/8/8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A2</td>\n",
       "      <td>16</td>\n",
       "      <td>女</td>\n",
       "      <td>2018/8/9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A3</td>\n",
       "      <td>47</td>\n",
       "      <td>女</td>\n",
       "      <td>2018/8/10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>A4</td>\n",
       "      <td>41</td>\n",
       "      <td>男</td>\n",
       "      <td>2018/8/11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   编号  年龄 性别       注册时间\n",
       "0  A1  54  男   2018/8/8\n",
       "1  A2  16  女   2018/8/9\n",
       "2  A3  47  女  2018/8/10\n",
       "3  A4  41  男  2018/8/11"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_table(\"test.csv\", sep = \",\")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1.4 SQL数据导入"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sqlalchemy import create_engine\n",
    "engine = create_engine('mysql+pymysql://root:Aa111111@localhost:3306/db')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "D:\\Anaconda3\\lib\\site-packages\\pymysql\\cursors.py:170: Warning: (1366, \"Incorrect string value: '\\\\xD6\\\\xD0\\\\xB9\\\\xFA\\\\xB1\\\\xEA...' for column 'VARIABLE_VALUE' at row 519\")\n",
      "  result = self._query(query)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>编号</th>\n",
       "      <th>年龄</th>\n",
       "      <th>性别</th>\n",
       "      <th>注册时间</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A1</td>\n",
       "      <td>54</td>\n",
       "      <td>男</td>\n",
       "      <td>2018/8/8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A2</td>\n",
       "      <td>16</td>\n",
       "      <td>女</td>\n",
       "      <td>2018/8/9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A3</td>\n",
       "      <td>47</td>\n",
       "      <td>女</td>\n",
       "      <td>2018/8/10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>A4</td>\n",
       "      <td>41</td>\n",
       "      <td>男</td>\n",
       "      <td>2018/8/11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   编号  年龄 性别       注册时间\n",
       "0  A1  54  男   2018/8/8\n",
       "1  A2  16  女   2018/8/9\n",
       "2  A3  47  女  2018/8/10\n",
       "3  A4  41  男  2018/8/11"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sql = \"SELECT * FROM Python_learning\"\n",
    "df = pd.read_sql(sql, engine)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.3熟悉数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在分析数据前，一定要对数据有充分的认识，至少要了解如下几点：<br>\n",
    "1.数据的结构是怎么样的<br>\n",
    "2.每个字段的类型是什么<br>\n",
    "3.数据的大小是多少（行数×列数）<br>\n",
    "4.数值型字段的分布情况，文本型字段的计数情况<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.3.1宏观了解"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "一般在查看数据时，用head()方法显示哪些行，默认展示前5行,不过也可以手动在括号中输入自己需要的行数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_dict = {\"编号\":[\"A1\",\"A2\",\"A3\",\"A4\",\"A3\",\"A4\",\"A2\"],\"年龄\":[54,16,47,41,47,41,16],\n",
    "        \"性别\":[\"男\",\"女\",\"女\",\"男\",\"女\",\"男\",\"女\"],\n",
    "        \"注册时间\":[\"2018-8-8\",\"2018-8-9\",\"2018-8-10\",\"2018-8-11\",\"2018-8-10\",\"2018-8-11\",\"2018-8-9\"]}\n",
    "df = pd.DataFrame(data_dict)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>编号</th>\n",
       "      <th>年龄</th>\n",
       "      <th>性别</th>\n",
       "      <th>注册时间</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A1</td>\n",
       "      <td>54</td>\n",
       "      <td>男</td>\n",
       "      <td>2018-8-8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A2</td>\n",
       "      <td>16</td>\n",
       "      <td>女</td>\n",
       "      <td>2018-8-9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A3</td>\n",
       "      <td>47</td>\n",
       "      <td>女</td>\n",
       "      <td>2018-8-10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>A4</td>\n",
       "      <td>41</td>\n",
       "      <td>男</td>\n",
       "      <td>2018-8-11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>A3</td>\n",
       "      <td>47</td>\n",
       "      <td>女</td>\n",
       "      <td>2018-8-10</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   编号  年龄 性别       注册时间\n",
       "0  A1  54  男   2018-8-8\n",
       "1  A2  16  女   2018-8-9\n",
       "2  A3  47  女  2018-8-10\n",
       "3  A4  41  男  2018-8-11\n",
       "4  A3  47  女  2018-8-10"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>编号</th>\n",
       "      <th>年龄</th>\n",
       "      <th>性别</th>\n",
       "      <th>注册时间</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A1</td>\n",
       "      <td>54</td>\n",
       "      <td>男</td>\n",
       "      <td>2018-8-8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A2</td>\n",
       "      <td>16</td>\n",
       "      <td>女</td>\n",
       "      <td>2018-8-9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   编号  年龄 性别      注册时间\n",
       "0  A1  54  男  2018-8-8\n",
       "1  A2  16  女  2018-8-9"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "查看完数据的整体结构后，再使用shape方法来查看行数×列数的情况<br>\n",
    "但有一点必须要注意的是：Excel在统计时，会把行索引与列索引一起算入<br>\n",
    "而Python则不会"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4, 4)"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_excel(\"C:/Users/ZHOU YUFENG/Documents/Python Scripts/对比Excel学习Python/Excel-Python-Learning-Note/test.xlsx\")\n",
    "df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "因为不同类型的数据有着不同的数据方法，所以必须用info()要明确各个字段的类型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 4 entries, 0 to 3\n",
      "Data columns (total 4 columns):\n",
      "编号      4 non-null object\n",
      "年龄      4 non-null int64\n",
      "性别      4 non-null object\n",
      "注册时间    4 non-null datetime64[ns]\n",
      "dtypes: datetime64[ns](1), int64(1), object(2)\n",
      "memory usage: 208.0+ bytes\n"
     ]
    }
   ],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "通过上面的info方法，可以得到如下信息： <br>\n",
    "1.行索引index是0-3<br>\n",
    "2.列索引（字段）共有4个：<br>\n",
    "分别是，①编号 ②年龄 ③性别 ④注册时间<br>\n",
    "年龄为int型，注册时间为datetime型，其余均为object类型<br>\n",
    "3.共占用208bytes<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.3.2微观了解"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果变量为数值型变量，则可以查看分布情况<br>\n",
    "describe()方法可以查看数据中所有数据型变量的分布"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>年龄</th>\n",
       "      <th>收入</th>\n",
       "      <th>家庭人数</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>4.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>4.00000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>25.750000</td>\n",
       "      <td>7250.000000</td>\n",
       "      <td>2.50000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>4.349329</td>\n",
       "      <td>1707.825128</td>\n",
       "      <td>0.57735</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>20.000000</td>\n",
       "      <td>5000.000000</td>\n",
       "      <td>2.00000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>23.750000</td>\n",
       "      <td>6500.000000</td>\n",
       "      <td>2.00000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>26.500000</td>\n",
       "      <td>7500.000000</td>\n",
       "      <td>2.50000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>28.500000</td>\n",
       "      <td>8250.000000</td>\n",
       "      <td>3.00000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>30.000000</td>\n",
       "      <td>9000.000000</td>\n",
       "      <td>3.00000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              年龄           收入     家庭人数\n",
       "count   4.000000     4.000000  4.00000\n",
       "mean   25.750000  7250.000000  2.50000\n",
       "std     4.349329  1707.825128  0.57735\n",
       "min    20.000000  5000.000000  2.00000\n",
       "25%    23.750000  6500.000000  2.00000\n",
       "50%    26.500000  7500.000000  2.50000\n",
       "75%    28.500000  8250.000000  3.00000\n",
       "max    30.000000  9000.000000  3.00000"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame([[20,5000,2],[25,8000,3],[30,9000,3],[28,7000,2]],\n",
    "                 columns = [\"年龄\",\"收入\",\"家庭人数\"])\n",
    "df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "文本型变量则采用用value_counts方法，添加参数dropna = False，这样可以观察是否存在NA值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "男    2\n",
       "女    2\n",
       "Name: 性别, dtype: int64"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_excel(\"C:/Users/ZHOU YUFENG/Documents/Python Scripts/对比Excel学习Python/Excel-Python-Learning-Note/test.xlsx\")\n",
    "df['性别'].value_counts(dropna=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
